-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Update resource failures w/ Finalizers set (#423) #5673
Conversation
## Overview when [informer cache has stale values](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L478), we cannot update the k8s resource when [clearing finalizers](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L450) and get `Error: Operation cannot be fulfilled on pods.` The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries. This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date. ## Test Plan Ran in dogfood-gcp - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4622 + manually updated configmap to enabled finalizers - Run without change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/fd16ac81fd7b5480fb6f/nodes) - Run with change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/f016a3be7fa304db5a77/nodeId/n0/nodes) confirmed in logs that conflict errors: ``` {"json":{"exec_id":"f016a3be7fa304db5a77","node":"n0/n42","ns":"development","res_ver":"146129599","routine":"worker-66","src":"plugin_manager.go:455","wf":"flytesnacks:development:tests.flytekit.integration.map_task_issue.wf8"},"level":"warning","msg":"Failed to clear finalizers for Resource with name: development/f016a3be7fa304db5a77-n0-0-n42-0. Error: Operation cannot be fulfilled on pods \"f016a3be7fa304db5a77-n0-0-n42-0\": the object has been modified; please apply your changes to the latest version and try again","ts":"2024-08-17T02:02:48Z"} ``` did not bubble up + confirmed finalizers were removed: ``` ➜ ~ k get pods -n development f016a3be7fa304db5a77-n0-0-n42-0 -o json | grep -i final INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags. ➜ ~ ``` ## Rollout Plan (if applicable) - revert changes to customer's config that disabled finalizers ## Upstream Changes Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS ## Issue fixes: https://linear.app/unionai/issue/COR-1558/investigate-why-finalizers-consume-system-retries-in-map-tasks ## Checklist * [ ] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5673 +/- ##
===========================================
+ Coverage 9.74% 36.17% +26.42%
===========================================
Files 214 1302 +1088
Lines 39190 109513 +70323
===========================================
+ Hits 3820 39611 +35791
- Misses 35031 65765 +30734
- Partials 339 4137 +3798
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
## Overview when [informer cache has stale values](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L478), we cannot update the k8s resource when [clearing finalizers](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L450) and get `Error: Operation cannot be fulfilled on pods.` The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries. This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date. ## Test Plan Ran in dogfood-gcp - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4622 + manually updated configmap to enabled finalizers - Run without change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/fd16ac81fd7b5480fb6f/nodes) - Run with change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/f016a3be7fa304db5a77/nodeId/n0/nodes) confirmed in logs that conflict errors: ``` {"json":{"exec_id":"f016a3be7fa304db5a77","node":"n0/n42","ns":"development","res_ver":"146129599","routine":"worker-66","src":"plugin_manager.go:455","wf":"flytesnacks:development:tests.flytekit.integration.map_task_issue.wf8"},"level":"warning","msg":"Failed to clear finalizers for Resource with name: development/f016a3be7fa304db5a77-n0-0-n42-0. Error: Operation cannot be fulfilled on pods \"f016a3be7fa304db5a77-n0-0-n42-0\": the object has been modified; please apply your changes to the latest version and try again","ts":"2024-08-17T02:02:48Z"} ``` did not bubble up + confirmed finalizers were removed: ``` ➜ ~ k get pods -n development f016a3be7fa304db5a77-n0-0-n42-0 -o json | grep -i final INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags. ➜ ~ ``` ## Rollout Plan (if applicable) - revert changes to customer's config that disabled finalizers ## Upstream Changes Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS ## Issue fixes: https://linear.app/unionai/issue/COR-1558/investigate-why-finalizers-consume-system-retries-in-map-tasks ## Checklist * [ ] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: pmahindrakar-oss <[email protected]>
…eorg#5673) ## Overview when [informer cache has stale values](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L478), we cannot update the k8s resource when [clearing finalizers](https://github.com/unionai/flyte/blob/1e82352dd95f89630e333fe6105d5fdb5487a24e/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go#L450) and get `Error: Operation cannot be fulfilled on pods.` The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries. This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date. ## Test Plan Ran in dogfood-gcp - https://buildkite.com/unionai/managed-cluster-staging-sync/builds/4622 + manually updated configmap to enabled finalizers - Run without change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/fd16ac81fd7b5480fb6f/nodes) - Run with change (https://dogfood-gcp.cloud-staging.union.ai/console/projects/flytesnacks/domains/development/executions/f016a3be7fa304db5a77/nodeId/n0/nodes) confirmed in logs that conflict errors: ``` {"json":{"exec_id":"f016a3be7fa304db5a77","node":"n0/n42","ns":"development","res_ver":"146129599","routine":"worker-66","src":"plugin_manager.go:455","wf":"flytesnacks:development:tests.flytekit.integration.map_task_issue.wf8"},"level":"warning","msg":"Failed to clear finalizers for Resource with name: development/f016a3be7fa304db5a77-n0-0-n42-0. Error: Operation cannot be fulfilled on pods \"f016a3be7fa304db5a77-n0-0-n42-0\": the object has been modified; please apply your changes to the latest version and try again","ts":"2024-08-17T02:02:48Z"} ``` did not bubble up + confirmed finalizers were removed: ``` ➜ ~ k get pods -n development f016a3be7fa304db5a77-n0-0-n42-0 -o json | grep -i final INFO[0000] [0] Couldn't find a config file []. Relying on env vars and pflags. ➜ ~ ``` ## Rollout Plan (if applicable) - revert changes to customer's config that disabled finalizers ## Upstream Changes Should this change be upstreamed to OSS (flyteorg/flyte)? If not, please uncheck this box, which is used for auditing. Note, it is the responsibility of each developer to actually upstream their changes. See [this guide](https://unionai.atlassian.net/wiki/spaces/ENG/pages/447610883/Flyte+-+Union+Cloud+Development+Runbook/#When-are-versions-updated%3F). - [x] To be upstreamed to OSS ## Issue fixes: https://linear.app/unionai/issue/COR-1558/investigate-why-finalizers-consume-system-retries-in-map-tasks ## Checklist * [ ] Added tests * [x] Ran a deploy dry run and shared the terraform plan * [ ] Added logging and metrics * [ ] Updated [dashboards](https://unionai.grafana.net/dashboards) and [alerts](https://unionai.grafana.net/alerting/list) * [ ] Updated documentation Signed-off-by: Paul Dittamo <[email protected]> Signed-off-by: Bugra Gedik <[email protected]>
Upstreamed from https://github.com/unionai/flyte/pull/423
Overview
when informer cache has stale values, we cannot update the k8s resource when clearing finalizers and get
Error: Operation cannot be fulfilled on pods.
The current implementation bubbles up the error resulting in a system retry. By the next loop, the informer cache is up to date and the update is able to be applied. However, in an ArrayNode with many subnodes getting executed in parallel, the execution can easily run out of retries.What changes were proposed in this pull request?
This update adds a basic retry with exponential backoff to wait for the informer cache to get up to date.
How was this patch tested?
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link